A Florida health insurance company wants to predict annual claims for individual clients. The company pulls a random sample of 50 customers. The owner wishes to charge an actuarially fair premium to ensure a normal rate of return. The owner collects all of their current customer’s health care expenses from the last year and compares them with what is known about each customer’s plan.

The data on the 50 customers in the sample is as follows:

  • Charges: Total medical expenses for a particular insurance plan (in dollars)
  • Age: Age of the primary beneficiary
  • BMI: Primary beneficiary’s body mass index (kg/m2)
  • Female: Primary beneficiary’s birth sex (0 = Male, 1 = Female)
  • Children: Number of children covered by health insurance plan (includes other dependents as well)
  • Smoker: Indicator if primary beneficiary is a smoker (0 = non-smoker, 1 = smoker)
  • Cities: Dummy variables for each city with the default being Sanford

Answer the following questions using complete sentences and attach all output, plots, etc. within this report.

For this assignment, ignore the categorical variables (gender, smoker, cities)

Question 1

Perform univariate analyses on the quantitative variables (center, shape, spread). Include descriptive statistics, and histograms. Be sure to use terms discussed in class such as bimodal, skewed left, etc.

str(Insurance)
## tibble [50 × 9] (S3: tbl_df/tbl/data.frame)
##  $ Charges      : num [1:50] 9145 7441 12143 3260 19023 ...
##  $ Age          : num [1:50] 52 45 60 31 39 25 25 57 34 42 ...
##  $ BMI          : num [1:50] 36.7 30.2 25.7 20.4 18.3 ...
##  $ Female       : num [1:50] 0 0 0 0 1 1 1 1 0 0 ...
##  $ Children     : num [1:50] 0 1 0 0 5 1 0 2 1 2 ...
##  $ Smoker       : num [1:50] 0 0 0 0 1 0 1 0 1 0 ...
##  $ WinterSprings: num [1:50] 0 0 0 0 0 0 0 0 0 0 ...
##  $ WinterPark   : num [1:50] 0 0 1 0 0 1 0 0 0 1 ...
##  $ Oviedo       : num [1:50] 1 1 0 1 1 0 1 1 0 0 ...
Insurance$Female <- NULL 
Insurance$WinterPark <- NULL
Insurance$WinterSprings <- NULL
Insurance$Oviedo <- NULL
Insurance$Smoker <- NULL

Insurance %>%
tbl_summary(statistic = list(all_continuous() ~ c("{mean} ({sd})",
"{median} ({p25}, {p75})",
"{min}, {max}"),
all_categorical() ~ "{n} / {N} ({p}%)"),
type = all_continuous() ~ "continuous2"
)
Characteristic N = 501
Charges
Mean (SD) 12,142 (11,317)
Median (IQR) 8,333 (4,360, 13,720)
Range 2,494, 55,135
Age
Mean (SD) 42 (13)
Median (IQR) 40 (30, 53)
Range 23, 64
BMI
Mean (SD) 28.7 (5.6)
Median (IQR) 28.0 (25.2, 32.2)
Range 16.8, 42.1
Children
0 17 / 50 (34%)
1 14 / 50 (28%)
2 12 / 50 (24%)
3 6 / 50 (12%)
5 1 / 50 (2.0%)
1 n / N (%)
plot_ly(x = Insurance$Age, type = "histogram", alpha = 0.6) %>% 
  layout(title = 'Distribution of Age',
         xaxis = list(title = 'Age of the primary beneficiary'),
         yaxis = list(title = 'Count'))
plot_ly(x = Insurance$Children, type = "histogram", alpha = 0.6) %>% 
  layout(title = 'Distribution of Children',
         xaxis = list(title = 'Number of children covered by health insurance plan (includes other dependents as well)'),
         yaxis = list(title = 'Count'))
plot_ly(x = Insurance$BMI, type = "histogram", alpha = 0.6) %>% 
  layout(title = 'Distribution of BMI',
         xaxis = list(title = 'Primary beneficiary’s body mass index (kg/m2)'),
         yaxis = list(title = 'Count'))
ggp1 <- ggplot(Insurance, aes(Insurance$Age)) +         
  geom_histogram(binwidth = 2 , col = 'black', fill = 'darkblue', alpha = 0.75)+
  labs(title = "Distribution of Primary Beneficiary Age", x = 'Age') + theme_bw()
ggp2 <- ggplot(Insurance, aes(Insurance$BMI)) +  
  geom_histogram(binwidth = 2,col = 'black', fill = 'blue', alpha = 0.75)+
  labs(title = "Distribution of Primary Beneficiary BMI", x = 'BMI') + theme_bw()
ggp3 <- ggplot(Insurance, aes(Insurance$Children)) +  
  geom_histogram(binwidth = 2,col = 'black', fill = 'lightblue', alpha = 0.50)+
  labs(title = "Number of children covered by health insurance plan", x = 'Children')+ theme_bw()

text1 <- paste("Text regarding age goes here break up sentences to make pretty")

text.a <- ggparagraph(text = text1, face = "italic", size = 11, color = "black")

text2 <- paste("Text regarind BMI here break up to make pretty")
text.b <- ggparagraph(text = text2, face = "italic", size = 11, color = "black")

ggarrange(ggp1, text.a, ncol = 2, ggp2, text.b, ggp3, align = "v", common.legend = TRUE)
## Warning: Use of `Insurance$Age` is discouraged. Use `Age` instead.
## Warning: Use of `Insurance$BMI` is discouraged. Use `BMI` instead.
## Warning: Use of `Insurance$Children` is discouraged. Use `Children` instead.
## Warning: Use of `Insurance$Age` is discouraged. Use `Age` instead.
## Warning: Use of `Insurance$BMI` is discouraged. Use `BMI` instead.
## Warning: Use of `Insurance$Children` is discouraged. Use `Children` instead.
## $`1`

## 
## $`2`

## 
## $`3`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

Jessica: This above is using data from star wars

Question 2

Perform bivariate analyses on the quantitative variables (direction, strength and form). Describe the linear association between all variables.

Question 3

Generate a regression equation in the following form:

\[Charges = \beta_{0}+\beta_{1}*Age+\beta_{2}*BMI+\beta_{3}*Children\]

 model <- lm(Charges ~ Age + BMI + Children, data = Insurance)
summary(model)
## 
## Call:
## lm(formula = Charges ~ Age + BMI + Children, data = Insurance)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7795  -5107  -3978  -2406  44936 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -5962.0     8845.5  -0.674  0.50368   
## Age            346.5      118.5   2.925  0.00533 **
## BMI            133.0      277.5   0.479  0.63388   
## Children      -107.3     1308.1  -0.082  0.93497   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10620 on 46 degrees of freedom
## Multiple R-squared:  0.1737, Adjusted R-squared:  0.1198 
## F-statistic: 3.223 on 3 and 46 DF,  p-value: 0.03109

#Regression #Charges = -5962.00 + 346.50Age + 133.00BMI - 107.30*Children also write out the regression cleanly in this document.

Question 4

An eager insurance representative comes back with a potential client. The client is 40, their BMI is 30, and they have one dependent. Using the regression equation above, predict the amount of medical expenses associated with this policy. (Provide a 95% confidence interval as well)

 newPrediction <- data.frame(Age = 40, BMI = 30, Children = 1)
predict (model, newdata = newPrediction, interval = "confidence", level = .95)
##        fit      lwr      upr
## 1 11782.35 8598.572 14966.13